Car Accident Severity Prediction


1. Introduction & Business Problem Understanding

1.1. Problem Background

Today, the number of people killed in road crashes around the world continues to increase. According to the World Health Organisation’s “Global Status Report on Road Safety”, it reached 1.35 million in 2016 alone. This means that, worldwide, more people die as a result of road traffic injuries than from HIV/AIDS, tuberculosis or diarrhoeal diseases. And road crashes are now the most common cause of death for children and young people between 5 and 29 worldwide.

Even the significant improvement in the situation in the field of road accidents (especially severe ones in terms of human fatality, traffic delay, property damage) in the "developed" countries over the last decades has not led to the complete elimination of the phenomenon as the mindset of a required "Vision Zero" still remain at stake (in Europe alone, car accidents kill weekly as many people as fit into a jumbo jet, and as we do not accept deaths in the air, we should no longer accept them on the road).

1.2. Approaching the Problem

The City of Seattle Government has long aligned its road safety policies with the framework and the targets UN and WHO have adopted approaching the issue in terms of "Safe System". The core elements of this approach are ensuring safe vehicles, safe infrastructure, safe road use (speed, sober driving, wearing safety belts and helmets) and better post-crash care.

But the end of the UN "Decade of Action for Road Safety" (2010-2020) happens to coincide with the moment that to the above systematic approach could be added tools that the modern capacities of data science offer today. Wouldn't it be great if we could exploit their well-established algorithms and available data for an extra, case tailored this time, preventive approach that could warn both drivers and traffic services, about the possibility of a car accident given some objective and subjective conditions and factors? Or wouldn’t it be useful to predict how severe this accident would be so that a driver would drive more carefully or change travel route, and traffic services prepare their response? Well, this potential promising contribution of machine learning to the global debate on road safety is the object of this project in which the long-established elements of the Safe System approach will be used as predictors for a machine learning model able to predict accident "severity".

1.3. Problem Definition and Stakeholders

The Seattle Department of Transportation (SDOT: the municipal government agency in Seattle, Washington that is responsible for the maintenance of the city's transportation systems, including roads, bridges, and public transportation) has asked us, on behalf of its Response Team as well as the Seattle Police Department (SPD), to build a ML model that will help them to a better perception-prediction of the risk of a severe road accident if we know and can evaluate quantitatively or qualitatively the conditions and circumstances of the municipal road network.

It should be noted that this is not a study which attempts to link, with an ML approach, the grid of all the causes and effects of a car accident. Rather, it is a study that attempts to connect specifically:

  • exactly one effect: the severity (human injury) or not (property damage only) of an accident with
  • a range of causes / factors contributing to the accident in relation with which the SDOT, its Response Team and the SPD can intervene (by improving or warning) and be prepared (to dispense in the most appropriate and effective way of its response resources)

2. Data

2.1. Source of the Data

The dataset we will rely has been updated weekly by the SDOT Traffic Management Division, and its data come from the Seattle Police Department Traffic Records and record all types of collisions from 2004 to May 2020.

2.2. Description of the data

Τhe dataset is rich and contains many observations (rows) and various attributes (columns). Before wrangling, our observations (rows) are 194,673, most of which are good to train and test the machine learning model. Of course, this does not mean that we do not need to proceed to data cleaning and tidying, as well as in data balancing (otherwise we would create a biased ML model) since as expected severe accidents are significantly less than non-severe ones (more specifically severe ones are slightly less than 1/3 of the total).

Attribute Description
SEVERITYCODE A code that corresponds to the severity of the collision:
  • 1-property damage
  • 2-human injury
X Accident Location's Longitude
Y Accident Location's Latitude
OBJECTID
INCKEY
COLDETKEY
REPORTNO
STATUS
ADRESSTYPE
INTKEY
no description
LOCATION Description of the general location of the collision
EXCEPTRSNCODE no description
SEVERITYCODE Repeat of 1st column (label/target)
EXCEPTRSNDESC no description
SEVERITYDESC A detailed description of the severity of the collision
COLLISIONTYPE Collision type
PERSONCOUNT The total number of people involved in the collision
PEDCOUNT The number of pedestrians involved in the collision. This is entered by the state
PEDCYLCOUNT The number of bicycles involved in the collision. This is entered by the state
VEHCOUNT The number of vehicles involved in the collision. This is entered by the state
INCDATE The date of the incident
INCDTTM The date and time of the incident
JUNCTIONTYPE Category of junction at which collision took place
SDOT_COLCODE A code given to the collision by SDOT
SDOT_COLDESC A description of the collision corresponding to the collision code
INATTENTIONIND Whether or not collision was due to inattention (Y/N)
UNDERINFL Whether or not a driver involved was under the influence of drugs or alcohol
WEATHER A description of the weather conditions during the time of the collision
ROADCOND The condition of the road during the collision
LIGHTCOND The light conditions during the collision
PEDROWNOTGRNT Whether or not the pedestrian right of way was not granted (Y/N)
SDOTCOLNUM A number given to the collision by SDOT
SPEEDING Whether or not speeding was a factor in the collision (Y/N)
ST_COLCODE A code provided by the state that describes the collision
ST_COLDESC A description that corresponds to the state’s coding designation
SEGLANEKEY A key for the lane segment in which the collision occurred
CROSSWALKKEY A key for the crosswalk at which the collision occurred
HITPARKEDCAR Whether or not the collision involved hitting a parked car (Y/N)
Metadata source

Our dataset's first column is the labeled data which describes the fatality of an accident/collision taking only two values that correspond to "severe" and "not severe" (binary classification problem). Of the remaining 36 columns of the dataset (with either numerical or categorical types of data), not all of them are useful for our classifier building:

  • Several (in italics in the above table) are keys, state designation codes or long, rather "useless" long textual descriptions
  • A few others (regular text) could be very useful in another type of problem and model building, where we would be interested in a more descriptive analytical approach that would include several aspects of a car accident (approached as a given event rather as a (probable) event)
  • Among the remaining ones (in boldface: location coordinates, date, time, junction type, inattention ivolved, driver under drugs/alcohol influence, weather, road and light conditions, speeding involved, "pedestrian right of way" not granted, ) could be the features that, along with some feature engineering, we will use to train the model.

2.3. Data Preparation

Let's first see if any of the selected variables have a significantly high number of missing values (converted into NaN values during csv's import to DataFrame format):

From the above we can easily conclude that there is no interest in using three more variables:

  • INATTENTIONIND: whether or not collision was due to inattention (Y/N),
  • PEDROWNOTGRNT: whether or not the pedestrian right of way was not granted (Y/N),
  • SPEEDING: whether or not speeding was a factor in the collision (Y/N)

as the percentage of observations without such values is respectively 84.69%, 97.60% ans 95.21%.

Having dealt with the issue of variables with significantly high number of missing values, we should also deal with the missing values (NaN and 'Unknown') of the remaining ones.

  1. For the NAN values, we decide to simply drop all the observations with at least one missing value, given their relatively small amount (as we can see the union set of all these observations is 14,587).
  1. For the "Unkwown" values, we proceed with the same approach, deleting them (the union set of all these observations is 14,883):

Last but not least, let's check the unique values of each variable we retained in order to confirm we have ended up with a clean dataset in what concern both the numerical and categorical variables we will try to use for our model building:

It looks like we have to deal with two more issues:

  1. the "Other" category of WEATHER, ROADCOND, LIGHTCOND (477 obs) and
  2. the four categories ("N", "0", "1", "Y") of UNDERINFL which apparently should be grouped in only two

Finally, before proceeding to our Exploratory Data Analysis and Model Building, let's convert INCDTTM into MONTH, WEEKDAY, and HOUR:

So we can say now that we have completed our data preparation tasks (cleansing and transforming raw data). Hence on, whatever transformation of our data will be imposed by the needs of their explotatory analysis and the per se feature selection/extraction and data preprocessing for the model buiding.


3. Methodology

In this project we will direct our efforts on building a good classifier with which we will try to predict the severity (involving human injury) of a car accident, given some conditions and space and tie coordinates.

In first step we have collected, cleaned and prepared our data getting rid of variables whose interest rest out of the scope of this project.

Second step is our exploratory analysis which will mainly be based on some bivariate and correlation analysis, that will try to investigate the relationship between our variables. Here we will try to understand which of them would be the better features for our model, which should be transformed for that, and which should be considered as abundant. Next we will work on the spatial aspect of our problem. We will try from the simple X-Y accident's coordinates to extract some other spatial feature, useful for our model and practical for the stakeholders (SDOT, SPD). After that, we should of course take care of some preprocessing of our dataset, so as to be in form of beiing digested by the algorithms, we will balance our severely unbalanced dataset, and we will also proceed to train-test split.

After all, we will try to build and optimize our classification model, based on four algorithms:

  • K Nearest Neighbor (KNN)
  • Decision Tree
  • Support Vector Machine
  • Logistic Regression To do so, we will test different values of their hypeparameters, and for each case we will keep the ones with the best performance so as to finally evaluate them against each other on the hold-out test dataset.

4. Analysis

4.1. Exploratory Data Analysis

DESCRIPTIVE STATISTICS

Let's start with a basic overview of our variables' descriptive statistics and types:

SEVERITYCODE X Y JUNCTIONTYPE UNDERINFL WEATHER ROADCOND LIGHTCOND WEEKDAY MONTH HOUR
count 164726 164726 164726 164726 164726 164726 164726 164726 164726 164726 164726
unique NaN NaN NaN 6 NaN 9 7 7 NaN NaN NaN
top NaN NaN NaN Mid-Block (not related to intersection) NaN Clear Dry Daylight NaN NaN NaN
freq NaN NaN NaN 72972 NaN 105761 118066 109379 NaN NaN NaN
mean 1.33111 -122.33 47.619 NaN 0.0528332 NaN NaN NaN 2.93985 6.54467 11.4907
std 0.470615 0.0297912 0.0566854 NaN 0.223701 NaN NaN NaN 1.91835 3.40742 6.88798
min 1 -122.419 47.4956 NaN 0 NaN NaN NaN 0 1 0
25% 1 -122.348 47.5743 NaN 0 NaN NaN NaN 1 4 7
50% 1 -122.33 47.6148 NaN 0 NaN NaN NaN 3 7 13
75% 2 -122.312 47.6637 NaN 0 NaN NaN NaN 5 10 17
max 2 -122.239 47.7341 NaN 1 NaN NaN NaN 6 12 23
type int64 float64 float64 object int64 object object object int64 int64 int64

BIVARIATE ANALYSIS

Only from the plot it seems the WEEKDAY variable, at least as it is, will not play a major role in our models. So, we proceed already in its transformation from a 7-categories variable to a binary variable, that is weekdays(M-T) : 0 and long weekend (F-S) : 1, and we replot:

As previously, from the plot it seems the MONTH variable, at least as it is, will not play a major role in our models. So, we proceed already in its transformation from a 12-class variable into a 4-class variable, that is Winter(1,2,12), Spring(3-5), Summer(6-8) and Autumn (9-11), and we replot:

As previously, from the plot it seems the MONTH variable, at least as it is, will not play a major role in our models. So, we proceed already in its transformation from a 12-class variable into a 4-class variable, that is Winter(1,2,12), Spring(3-5), Summer(6-8) and Autumn (9-11), and we replot:

CRAMMER'S V CORRELATION

With our extended df, we will proceed to a correlation investigation of our variables between each other, and with the target variable. But since the majority of our data, always excluding X and Y coordinates, is categorical, we will use the Crammer's V Correlation and its heatmap.

From the above, we can safely conclude that we have to choose:

  • only one between HOUR and DAY_PERIOD whose correlation is 1,
  • only one as well between MONTH and SEASON whose correlation is 1, and
  • only one between WEEK_DAY/END and WEEKDAY for the same reason.

We decide will go respectiveley for DAY_PERIOD, MONTH and WEEKDAY, instead of HOUR, SEASON, WEEK_DAY/END. In all cases, we based our choise on which variable has less Crammer's V Correlation with the rest of the variables (we would also use the criterion which variable has more Crammer's V Correlation with the target variable (SEVERITYCODE) but in our case all six pairs' values are 0).

  • DAY_PERIOD: 0.17/9 < HOUR: 0.24/9
  • SEASON: 0.08/9 > MONTH: 0.04/9
  • WEEK_DAY/END: 0.07/9 > WEEKDAY: 0.03/9

4.2. Spatial Feature Extraction

So it is time to deal with the spatial parameter of our problem. There are three ways to integrate the data of columns X and Y in the final table with which we will train our model:

  1. Not changing anything, and using these features as they are,
  2. With a clustering approach, replacing the accidents' coordinates with the cluster that each one is placed,
  3. With a binning and creating a grid map approach, cutting the map in 2-D bins and replacing the accidents' geographic coordinates with its corresponding grid zone.
  4. With a bining based on distance from city center approach

1. The first option, not changing anything and feeding our model with the geographical coordinates of each accient, doesn't seem very efficient or helpful for the practical needs of the SDOT and SPD forces that will be overseeing the roads of the area and alerting for response wherever necessary. It's much more practical for a service to have areas with related characteristics (which have a higher or lower probability of a serious accident and require a specific level of protective and precautionary measures and vigilance) different from their neighboring ones. rouping in this way, we may lose in precision (since we would act/think with the notion of the "average") but we gain in abstraction as we have to worry about less details. And in fact, no one guarantees that we do not lose in accuracy. Even more, training the algorithm with such in detail differentiated observations will surely lead to over-fitting.

2. Regarding the clustering approach, it would be interesting in our case to try two diffent a clustering techniques:

  • a Density-Based Spatial Clustering of Applications with Noise (DBSCAN), based on location and SEVERITYCODE, and
  • a K-means Clustering, based again on location and SEVERITYCODE.

But as we can see in our Appendix, both these methods don't lead to results that will contribute to the practicality aspect of our project as

  • create very few very small clusters with the majority of accidents remaining without cluster, as noise (DBSCAN),
  • recognise very little noise as well as a multitude of very small clusters (DBSCAN),
  • fewer or more, but definetely overlapping, clusters (K-means),
  • clusters of a shape that, especially in combination with the very large and continuous accident concentration, don't contribute to the practicality goals of our project.

3. 2-D Location Binning:

Simplest and in our case more efficient and usable than the previous approaches is to create a grid map based on meridians and parallels (this is why the grid - in mercatorian projection - ressembles leaning). We can even calculate the average SEVERITYCODE, practically the average probability of severe car accident for each one f the grid's blocks.


4. Distance from city center

Another approach is to group the accidents by distance from Seattle's city center, and calculate for this case as well, the average SEVERITYCODE (AVG(SEVERITYCODE)-1 ~ Severity Probability):

SEVERITYCODE 1 2
DISTANCE_bin
0.0-2.5 0.685798 0.314202
2.5-5.0 0.680294 0.319706
5.0-7.5 0.676166 0.323834
7.5-10.0 0.659823 0.340177
10.0-12.5 0.630911 0.369089
12.5-15.0 0.607610 0.392390

So, let's see a first time what the correlation between the two new spatial features is. To do so, we trust once more the Crammer's V Correlation:

From the above figure, an indisputable correlation of these two features becomes apparent. So, arriving at the stage of model-building, in the beginning, we will not use the two features at the same time. We will first try to train the models by including only one of the XY_bin and DISTANCE_bin. And since our primary goal is both to make predictions as well as to better understand the role of each independent variable, then we will try both at the same time, keeping in mind that collinearity affects the coefficients and p-values, but it does not influence the predictions, precision of the predictions, and the goodness-of-fit statistics.

4.3. Data Preprocessing

Dataset Balancing

The challenge of working with imbalanced datasets is that most machine learning techniques will ignore, and in turn have poor performance on, the minority class, although typically it is performance on the minority class that is most important.

One approach to addressing imbalanced datasets is to oversample the minority class and we can find here simpler and more elaborated techniques (SMOT). In our case we will proceed by simply downsampling the majority class (SEVERITYCODE=1):

Train-Test Data Split

Before starting the model building, let's select randomly a 5% fraction of our data dataset and hold it out, so as to use it eventually as the test dataset for our different ML algorithms.

Encoding

We continue by encoding the variables that need it, choosing One-Hot-Encoding for some of them

We will just drop all the features who count less than 1/1000 occurences in our dataset.

So, we end up with 26 features

Normalization

Finally we normalize our features, taking at the same time care of creating 4 different features sets, that is, all possible combinations with or without XY_bin and DISTANCE_bin.

4.4. Classification Model Building

Now, it is time to build our accurate model and then use a test set to report the accuracy. We should use the following algorithm:

  • K Nearest Neighbor(KNN)
  • Decision Tree
  • Support Vector Machine
  • Logistic Regression

in all the four different options:

  • with only XY_bin feature,
  • with only DISTANCE_bin feature,
  • with both XY_bin and DISTANCE_bin features
  • with neither XY_bin or DISTANCE_bin features

K Nearest Neighbor (KNN)

Instead of a train-test split with a simple out-of-sample validation of the result, we will rather use a 10-fold cross-validation:

Instead of using the Grid (or Random) search embeded in scikit-learn library in order to optimize the hyperparameter k, we will rather use iteration to caclulate different models with different values of k nearest neighbors for the different feature sets (with and without the two spatial features); then we will calculate and plot the corresponding average accuracies based on the cross validation

For proceeding to so many different and computationally expensive enough tests, we will use only a sample fraction (10%) of the whole dataset.

So let's start the accuracy of k values from 1 to 100 with the Xboth feature set:

According to our results, for Xboth feature set it's better to use a k value equal to 32 (elbow). Thus we choose the k to be 30 and we train again our model, this with all our data.

We go on with the Xxy feature set:

According to our results, for Xxy feature set it's better to use a k value equal to 24 (elbow). Thus we choose the k to be 30 and we train again our model with all our data.

We go on with the Xdist feature set:

According to our results, for Xdist feature set it's better to use a k value equal to 23 (elbow). Thus we choose the k to be 23 and we train again our model, this time with all our data.

And we conclude the knn classifier optimization part with the Xnone feature set:

According to our results, for Xnone feature set it's better to use a k value equal to 30. Thus we choose the k to be 30 and we train again our model with all our data.

Decision Tree

Here again, we wiil stick to previous approach: we will use iteration to calculate different models with different values of max_depth for our decision trees

From the above results, it becomes clear that we shouldn't use a Decision Tree algorithm Classifier. We trained all datasets with increasing depth and charted out the performances. As apparent from the figure, as the depth the of tree increases, the model accuracy (the metric of evaluation) clearly decreases. This observation concerns of course only the validation and cross-validation datasets. If we had tried to measure the accuracy on the training set itself, it is absolutely certain and logical we would observe a severe gap. The tendance in that case would be the opposite as accuracy for the training dataset increases as the complexity (in our case: max depth) increases. As we can see as the complexity of decision tree increases with increase in tree depth, in turn, overfitting also increases. The fact that Accuracy for the validation set (or CV accuracy) drops significantly as tree depth increases, suggests overfitting but that is in any case an insctance of one of the critical shortcomings of decision trees that any data scientist should be aware about: Decision Trees very easily ovefit.

Support Vector Machine (SVM)

We will also try the prediction efficiency of the Support Vector Machine algorithm with different kernel functions — radial basis, linear, polynomial and sigmoid:

accuracy_rbf_both: 0.5903684335940845 
accuracy_linear_both: 0.5905596221652152 
accuracy_poly_both: 0.5867006660883227 
accuracy_sigmoid_both: 0.5380693305235253
accuracy_rbf_xy: 0.5900789513856053 
accuracy_linear_xy: 0.5905596221652152 
accuracy_poly_xy: 0.5870865803122312 
accuracy_sigmoid_xy: 0.5250415141688044
accuracy_rbf_dist: 0.5874735184281954 
accuracy_linear_dist: 0.5905596221652152 
accuracy_poly_dist: 0.5846747560344474 
accuracy_sigmoid_dist: 0.5336294553266587
accuracy_rbf_none: 0.5898867389224188 
accuracy_linear_none: 0.5905596221652152 
accuracy_poly_none: 0.5821657550924668 
accuracy_sigmoid_none: 0.5345931239132782

From the above, it becomes evident it's the linear kernel function that we should use for our Support Vector Machine algortihm Classifier. Thus, accordingly we train again our models with all our data.

Logistic Regression

Finally, we will try a Logistic Regression approach and we will start by testing the differnet solvers for C=0.01

Accuracy Xboth liblinear: 0.5889210225516879 
Accuracy Xboth newton-cg: 0.5886314472621127 
Accuracy Xboth lbfgs: 0.5887279723586378 
Accuracy Xboth sag: 0.5886314472621127 
Accuracy Xboth saga: 0.5886314472621127
Accuracy Xxy liblinear: 0.5888244043740668 
Accuracy Xxy newton-cg: 0.5893068436945004 
Accuracy Xxy lbfgs: 0.5892103185979752 
Accuracy Xxy sag: 0.5893068436945004 
Accuracy Xxy saga: 0.5893068436945004
Accuracy Xdist liblinear: 0.5895962328218838 
Accuracy Xdist newton-cg: 0.5896928509995049 
Accuracy Xdist lbfgs: 0.5896928509995049 
Accuracy Xdist sag: 0.5896928509995049 
Accuracy Xboth saga: 0.5896928509995049
Accuracy Xnone liblinear: 0.5904644002040338 
Accuracy Xnone newton-cg: 0.5907537893314171 
Accuracy Xnone lbfgs: 0.5907537893314171 
Accuracy Xnone sag: 0.5907537893314171 
Accuracy Xnone saga: 0.5907537893314171

Even the differences are very small, or exactly due to this, we can safely conclude that we should pich and try to optimize the performance of liblinear solver for all our datasets.

In the meantime, by restraining more and more the C values field, we end up investigating the best value for parameter C in the space [\0.0005, 0.005].

So, it comes up that the ideal C values for all four feature sets are:

both: 0.0036000000000000008
xy: 0.0014500000000000003
dist: 0.0035500000000000006
none: 0.00255

And accordingly we train again our models, this time with all the data.


Results and Discussion: Model Evaluation & Selection

Model Evaluation using Test set

In order to evaluate the selected models, we will use the hold-out test dataset, which we have to preprocess the way (using the same parameters) we did with the training-validation dataset.

We will just drop all the features who count less than 1/1000 occurences in our dataset.

Algorithm Feature Set Accuracy Jaccard F1-score LogLoss
KNN both 0.59 0.59 0.59 NA
KNN xy 0.57 0.57 0.56 NA
KNN dist 0.57 0.57 0.57 NA
KNN none 0.57 0.57 0.57 NA
SVM both 0.59 0.59 0.59 NA
SVM xy 0.59 0.59 0.59 NA
SVM dist 0.59 0.59 0.59 NA
SVM none 0.59 0.59 0.59 NA
Logistic Regression both 0.59 0.59 0.59 0.67
Logistic Regression xy 0.59 0.59 0.59 0.67
Logistic Regression dist 0.59 0.59 0.59 0.67
Logistic Regression none 0.6 0.6 0.6 0.67

From the above, we can say that we would prefer the Logistic Regression Algorithm, preferably with the dataset that doesn't include at all spatial features. In any case we can see that the predicition algorithms' performance cannot be considered as good. This low performance is getting even more observable if we take into account that for an absolutely balanced binary classification the performance of the worst algorithm, that is blind guessing, has an accuracy of 50%. Our future work should be concentrated on improving variables' exploratory analysis for a better feature extraction and selection. Mainly, though, we whould insist more on the feature engineering exploiting the location data. Clustering (DBSCAN and K-means) has its limitations but there is a lot of interesting and potential work to be done towards this direction, so as to manage to create "practical" clusters based on locaton and car accident severity historical data.

6. Conclusion

In this study, our goal was to predict accurately the severity type of an accident depending on the given features.The results can have a better performance, a lot of improvement can be done on class 1 and 2 predictions. These models can be very useful in helping weather stations or news program alert drivers of the probabilities of car crashes and its type of severity (damage, injuries, fatality,…).

Appendix

K-MEANS Clustering: Visualization of Clusters


DBSCAN: Visualization of clusters based on location and SEVERITYCODE